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^^ I Abstract. In the context of sparse principal component detection, we 

CN ■ bring evidence towards the existence of a statistical price to pay for 

^h , computational efficiency. We measure the performance of a test by the 

smallest signal strength that it can detect and we propose a computa- 



tionally efficient method based on semidefinite programming. We also 
prove that the statistical performance of this test cannot be strictly 
improved by any computationally efficient method. Our results can be 
viewed as complexity theoretic lower bounds conditionally on the as- 
sumptions that some instances of the planted clique problem cannot be 
solved in randomized polynomial time. 
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(N ' 1. INTRODUCTION 

00 ; 

^^ . The modern scientific landscape has been significantly transformed over the 

Tj - \ past decade by the apparition of massive datasets. From the statistical learning 

point of view, this transformation has led to a paradigm shift. Indeed, most novel 
methods consist in searching for sparse structure in datasets, whereas estimating 
parameters over this structure is now a fairly well understood problem. It turns 
out that most interesting structures have a combinatorial nature, often leading 
to computationally hard problems. This has led researchers to consider various 
numerical tricks, chiefly convex relaxations, to overcome this issue. While these 
new questions have led to fascinating interactions between learning and optimiza- 
tion, they do not always come with satisfactory answers from a statistical point 
of view. The main purpose of this paper is to study one example, namely sparse 
principal component detection, for which current notions of statistical optimality 
should also be shifted, along with the paradigm. 

Sparse detection problems where one wants to detect the presence of a sparse 
structure in noisy data falls in this line of work. There has been recent inter- 
est in detection problems of the form signal-plus-noise, where the signal is a 
vector with combinatorial structure [ABBDL10, ACCP11, ACV13] or even a ma- 
trix [BI13, SN13, KBRS11, BKR+11]. The matrix detection problem was pushed 
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beyond the signal-plus-noise model towards more complicated dependence struc- 
tures [ACBL12, ACBL13, BR12]. One contribution of this paper is to extend 
these results to more general distributions. 

For matrix problems, and in particular sparse principal component (PC) detec- 
tion, some computationally efficient methods have been proposed, but they are 
not proven to achieve the optimal detection levels. [JL09, CMW12, Mal3] suggest 
heuristics for which detection levels are unknown and [BR12] prove suboptimal 
detection levels for a natural semidefinite relaxation developed in [dGJL07] and an 
even simpler, efficient, dual method called Minimum Dual Perturbation (MDP). 
More recently, [dBG12] developed another semidefinite relaxation for sparse PC 
detection that performs well only outside of the high-dimensional, low sparsity 
regime that we are interested in. Note that it follows from the results of [AW09] 
that the former semidefinite relaxation is optimal if it has a rank-one solution. 
Unfortunately, rank-one solutions can only be guaranteed at suboptimal detec- 
tion levels. This literature hints at a potential cost for computational efficiency 
in the sparse PC detection problem. 

Partial results were obtained in [BR12] who proved that their bound for MDP 
and SDP are unlikely to be improved, as otherwise they would lead to randomized 
polynomial time algorithms for instances of the planted clique problem that are 
believed to be hard. This result only focuses on a given testing method, but sug- 
gests the existence of an intrinsic gap between the optimal rates of detection and 
what is statistically achievable in polynomial time. Such phenomena are hinted 
at in [CJ13] but their these results focus on the behavior of upper bounds. Closer 
to our goal, is [SSST12] that exhibits a statistical price to pay for computational 
efficiency. In particular, their derive a computational theoretic lower bound using 
a much weaker conjecture than the hidden clique conjecture that we employ here, 
namely the existence of one-way permutations. This conjecture is widely accepted 
and is the basis of many cryptographic protocols. Unfortunately, the lower bound 
holds only for a synthetic classification problem that is somewhat tailored to this 
conjecture. It still remains to fully describe a theory, and to develop lower bounds 
on the statistical accuracy that is achievable in reasonable computational time 
for natural problems. This article aims to do so for a general sparse PC detection 
problem. 

This paper is organized in the following way. The sparse PC detection problem 
is formally described in Section 2. Then, we show in Section 3 that our general 
detection framework is a natural extension of the existing literature, and that 
all the usual results for classical detection of sparse PC are still valid. Section 4 
focuses on testing in polynomial time, where we study detection levels for the 
semidefinite relaxation developed of [dGJL07] (It trivially extends to the MDP 
statistic of [BR12]). These levels are shown to be unimprovable using computa- 
tionally efficient methods in Section 5. This is achieved by introducing a new 
notion of optimality that takes into account computational efficiency. Practically, 
we reduce the planted clique problem, conjectured to be computationally hard 
already in an average-case sense (i.e. over most random instances) to obtaining 
better rates for sparse PC detection. 

Notation. The space of d x d symmetric real matrices is denoted by S^. We 
write Z y whenever Z is semidefinite positive. We denote by N the set of 
nonnegative integers and define Ni = N \ {0}. 
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The elements of a vector v G R are denoted by v\, . . . ,Vd and similarly, 
a matrix Z has element Zij on its ith row and jth column. For any q > 0, 
\v\ q denotes the £ q "norm" of a vector v and is defined by \v\ q = (^ • \vj\ q ) 1 ' q . 
Moreover, we denote by \v\q its so-called £q "norm", that is its number of nonzero 
elements. Furthermore, by extension, for Z G Sd, we denote by \Z\ q the £ q norm 
of the vector formed by the entries of Z. We also define for q G [0, 2) the set 
B q {R) of unit vectors within the ^-ball of radius R > 

B q (R) = {v G H d : \v\ 2 = 1 , \v\ q < R} . 

For a finite set S, we denote by \S\ its cardinality. We also write As for the 
IS"! x jS"! submatrix with elements (Aij)ij & s, and vs for the vector of W s ' with 
elements Vi for i G S. The vector 1 denotes a vector with coordinates all equal to 
1. If a vector has an index such as Vi, then we use Vij to denote its jth element. 

The vectors e« and matrices Eij are the elements of the canonical bases of R 
and H dxd . We also define 5 d_1 as the unit Euclidean sphere of R d and Sg^ 1 the 
set of vectors in S^ 1 with support S C {1, . . . , d}. The identity matrix in R rf is 
denoted by Id- 

A Bernoulli random variable with parameter p G [0, 1] takes values 1 or with 
probability p and 1 — p respectively. A Rademacher random variable takes values 1 
or —1 with probability 1/2. A binomial random variable, with distribution B(n,p) 
is the sum of n independent Bernoulli random variables with identical parameter 
p. A hypergeometric random variable, with distribution H(N, k, n) is the random 
number of successes in n draws from a population of size N among which are k 
successes, without replacement. The total variation norm, noted || • ||tv has the 
usual definition. 

The trace and rank functionals are denoted by Tr and rank respectively and 
have their usual definition. We denote by T c the complement of a set T. Finally, 
for two real numbers a and b, we write a A b = min(a, 6), a V b = max(a, 6), and 
a + = a V . 

2. PROBLEM DESCRIPTION 

Let X G R be a centered random vector with unknown distribution P that has 
finite second moment along every direction. The first principal component for X 
is a direction v G S such that the variance \l(v) = E[(-u A") 2 ] along direction 
v is larger than in any other direction. If no such v exists, the distribution of 
X is said to be isotropic. The goal of sparse principal component detection is 
to test whether X follows an isotropic distribution Po or a distribution P„ for 
which there exists a sparse v G Bo(k), feed, along which the variance is large. 
Without loss of generality, we assume that under the isotropic distribution Po, 
all directions have unit variance and under P„, the variance along v is equal to 
1 + 9 for some positive 9. Note that since v has unit norm, 9 captures the signal 
strength. 

To perform our test, we observe n independent copies X±, . . . ,X n of X. For 
any direction u G <S rf_1 , define the empirical variance along u by 



Vn(u) = if> T X t ) 



n 
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Clearly the concentration of V n (u) around V(u) will have a significant effect on the 
performance of our testing procedure. If, for any u G 5 rf_1 , the centered random 
variable (u X) 2 — K[(u X) 2 ] satisfies the conditions for Bernstein's inequality 
(see, e.g., [Mas07], eq. (2.18), p. 24) under both Po and P„, then, up to numerical 
constants, we have 

(1) 



(2' 



sup PrOVnN-l|>W^M +4 MlM)<^ W >0, 

V ' V n n 



Pr(V n (t;) - (1 + 6) < -2 / 2^MgM _ 4 M2M ) < ^ V *>0, „€*,(*) 



n n 

Such inequalities are satisfied if we assume that Po and P„ are sub-Gaussian 
distributions for example. Rather than specifying such an ad-hoc assumption, 
we define the following sets of distributions under which the fluctuations of V n 
around V are of the same order as those of sub-Gaussian distributions. As a result, 
we formulate our testing problem on the unknown distribution P of X as follows 

H : P G V = {Po : (1) holds} 

Hi : P 6 V\{6) = \J {P v : (2) holds} . 

veB t) (k) 

Note that distributions in Vq and T>\(6) are implicitly centered at zero. 

We argue that interesting testing procedures should be robust and thus perform 
well uniformly over these distributions. In the rest of the paper, we focus on such 
procedures. The existing literature on sparse principal component testing, partic- 
ularly in [BR12] and [ACBL12] focuses on multivariate normal distributions, yet 
only relies on the sub-Gaussian properties of the empirical variance along unit 
directions. Actually, all the distributional assumptions made in [VL12, ACBL12] 
and [BR12] are particular cases of these hypotheses. We will show that concen- 
tration of the empirical variance as in (1) and (2) is sufficient to derive the results 
that were obtained under the sub-Gaussian assumption. 

Recall that a test for this problem is a family ij) = {il>d,n,k} of {0, l}-valued 
measurable functions of the data (X\, . . . , X n ). Our goal is to quantify the small- 
est signal strength 9 > for which there exists a test ip with maximum test error 
bounded by 5 > 0, i.e., 

sup {P 3n (V = l)VPf n (^ = 0)) <S. 
PoeX>o l J 

Piez>f(0) 

To call our problem "sparse" , we need to assume somehow that k is rather small. 
Throughout the paper, we fix a tolerance < 5 < 1/3 (e.g., 5 = 5%) and focus 
on the case where the parameters are in the sparse regime Ro C Nf of positive 
integers defined by 



R = {(d, n, k) e N? : 15 y ggJgZS < ! , k < d °^ } . 

Note that the constant 0.49 is arbitrary and can be replaced by any constant 
C <0.5. 
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Definition 1. Fix a set of parameters R C Ro in the sparse regime. LetT be 
a set of tests. A function 9* of (d, n,k) G R is called optimal rate of detection 
over the class T if for any (d,n,k) G R, it holds: 

(i) there exists a test ip G 7* that discriminates between Hq and Hi at level c9* 
for some constant c > 0, i.e., for any 9 > c9* 

sup {p® n (V = l)VPf n (V = 0)| <8. 
Pie©f(fl) 

In this case we say that ifi G T discriminates between Hq and Hi at rate 9* . 
(ii) for any test <f> G 7", there exists a constant Ca, > such that 9 < cj)* implies 



sup 
P 6£>o 



{pf n ((/>=l)VPf n (0 = O)} >S. 



Moreover, if both (i) and (ii) hold, we say that ip is an optimal test over the class 
T. 

This an adaptation of the usual notion of statistical optimality, when one is 
focusing on the class of measurable functions, for ipd,n,k '■ (Xi, ■ ■ ■ , X n ) \— > {0, 1}, 
also known as minimax optimality [Tsy09]. In order to take into account the 
asymptotic nature of some classes of statistical tests (namely, those that are 
computationally efficient), we allow the constant c^ in (ii) to depend on the test. 

3. STATISTICALLY OPTIMAL TESTING 

We focus first on the traditional setting where T contains all sequences {ipd,n,k} 
of tests. 

Denote by X = E[XJ ] the covariance matrix of X and by S its empirical 
counterpart: 

1 n 

(3) t = -Y,X t Xj. 

Observe that V(n) = u T T,u and V n (n) = u T Sn, for any u G S d ~ l . Maximizing 
y n (u) over Bo(k) gives the largest empirical variance along any fc-sparse direction. 
It is also known as the A:-sparse eigenvalue of E defined by 

(4) A^ iax (S) = maxa T Eii. 

MeSo(fc) 

The following theorem describes the performance of the test 

(5) Vd,n,fe = l{ALx(S)>l + r}, r>0. 
Theorem 2. Assume that (d,n,k) G Rq and define 



*= u >*m 



n 
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/fcl og (6£d) 

Then, for 9 < 9 < 1, the test ip defined in (5) with threshold r = 81/ n > 

satisfies 

sup |p® n (V> = l) VPf n (^ = 0)) <S. 
PieDf(<9) 



Proof. Define t\ = 7y/k\og(2/5)/n. For Pi G 2^(0), by (2), and for P 6 2?o, 
using Lemma 10, we get 

Po 0n (ALx(S) > 1 + r) < 5, Pf"(AL x (S) < 1 + -n) < 5. 

To conclude the proof, observe that r < 9 — n < 9 — t\. □ 

The following lower bound follows directly from [BR12], Theorem 5.1 and holds 
already for Gaussian distributions. 

Theorem 3. For all e > 0, there exists a constant C £ > such that if 



k\og{C £ d/k 2 + l) 



n 



any test (ft satisfies 

sup {pH0 = i)vpH0 = o)}>^-£. 

v x ev\(d) 
Theorems 2 and 3 imply the following result. 
Corollary 4. The sequence 



klogd . 

, (d, n, k) e R , 

n 

is the optimal rate of detection over the class of all tests. 

4. POLYNOMIAL TIME TESTING 

It is not hard to prove that approximating A^ ax (^4) up to a factor of m l ~ £ , e > 
0, for any symmetric matrix A of size m x m and any k £ {1, . . . ,m} is NP- 
hard, by a trivial reduction to CLIQUE (see [Has96, Has99, Zuc06] for hardness 
of approximation of CLIQUE). Yet, our problem is not worst case and we need 
not consider any matrix A. Rather, here, A is a random matrix and we cannot 
directly apply the above results. 

In this section, we look for a test with good statistical properties and that can 
be computed in polynomial time. Indeed, finding efficient statistical methods in 
high-dimension is critical. Specifically, we study a test based on a natural convex 
(semidefinite) relaxation of A^ iax (S) developed in [dGJL07]. 

For any A y let SDP^(^4) be defined as the optimal value of the following 
semidefinite program: 

(6) SDP fe (^) = max. Tr(AZ) 

subject to Tr(Z) = 1, \Z\ x <k, Z y 
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This optimization problem can be reformulated as a semidefmite program in its 
canonical form with a polynomial number of constraints and can therefore be 
solved in polynomial time up to arbitrary precision using interior point methods 
for example [BV04]. Indeed, we can write 

SDP fc (A) = max. £ A^ - z~) 

subject to z~t = Zjl > 0, z^ = z~ { > 

* hi 

E(4 - z *)( E u + E n) + E(4 - %)£« ^ o • 

Consider the following test 

(7) ^,n,fe = l{SDP^ ) (S)>l + r}, r>0, 

where SDP^ is a l/y^-approximation of SDP&. [BAdlO] show that SDPL can 
be computed in 0(kd?\/nlogd) elementary operations and thus in polynomial 
time. 

Theorem 5. Assume that (d,n,k) are such that 



^og^y,!)^ 



n 



Then, for 9 G [6, 1], the test tp defined in (7) with threshold r = 16y — ^ — — + 
-7= , satisfies 

sup (p<f n (?/; = 1) V Pf n (V> = 0)) <<5. 

P 6X>o l J 

Pieof(e) 



Proof. Define 



/ Plog(4d 2 /<5) / fclog(4/g) 

r = 16a/ , n = 7W . 

V n V n 

For all 5 > 0, Po € 2?o>Pi £ ^i(^)) by Lemma 11 and Lemma 10, since 
SDP fc (S) > ALx(S), it holds 

P® n (SDP fe (£)>l + T ) <6, Pf^fsDP^Ej^l + e-n) <<5. 



Recall that |SDP^ ra) — SDP fc | < 1/Vn and observe that r + 1/y/n = t<6-ti< 

9-n. D 

This size of the detection threshold 9 is consistent with the results of [AW09, 
BR12] for Gaussian distributions. 

Clearly, this theorem, together with Theorem 3, indicate that the test based 
on SDP may be suboptimal within the class of all tests. However, as we will 
see in the next section, it can be proved to be optimal in a restricted class of 
computationally efficient tests. 
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5. COMPLEXITY THEORETIC LOWER BOUNDS 

It is legitimate to wonder if the upper bound in Theorem 5 is tight. Can faster 
rates be achieved by this method, or by other, possibly randomized, polynomial 
time testing methods? Or instead, is this gap intrinsic to the problem? A partial 
answer to this question is provided in [BR12], where it is proved that the test de- 
fined in (7) cannot discriminate at a level significantly lower than 9. Indeed, such 
a test could otherwise be used to solve instances of the planted clique problem 
that are believed to be hard. This result is supported by some numerical evidence 
as well. 

In this section, we show that it is true not only of the test based on SDP but 
of any test computable in randomized polynomial time. 

5.1 Lower bounds and polynomial time reductions 

The upper bound of Theorem 5, if tight, seems to indicate that there is a gap 
between the detection levels that can be achieved by any test, and those that can 
be achieved by methods that run in polynomial time. In other words, it indicates a 
potential statistical cost for computational efficiency. To study this phenomenon, 
we take the approach favored in theoretical computer science, where our primary 
goal is to classify problems, rather than algorithms, according to their compu- 
tational hardness. Indeed, this approach is better aligned with our definition of 
optimal rate of detection where lower bounds should hold for any tests. Unfortu- 
nately, it is difficult to derive a lower bound on the performance of any candidate 
algorithm to solve a given problem. Rather, theoretical computer scientists have 
developed reductions from problem A to problem B with the following conse- 
quence: if problem B can be solved in polynomial time, then so can problem A. 
Therefore, if problem A is believed to be hard then so is problem B. Note that 
our reduction requires extra bits of randomness and is therefore a randomized 
polynomial time reduction. 

This question needs to be formulated from a statistical detection point of 
view. As mentioned above, A^ ax can be proved to be NP-hard to approximate. 
Nevertheless, such worst case results are not sufficient to prove negative results 
on our average case problem. Indeed, the matrix is S is random and we only need 
to be able to approximate A^ ax (E) up to constant factor on most realizations. 
In some cases, this small nuance can make a huge difference, as problems can be 
hard in the worst case but easy in average (see, e.g., [Bop87] for an illustration 
on Graph Bisection). In order to prove a complexity theoretic lower bound on the 
sparse principal component detection problem, we will build a reduction from a 
notoriously hard detection problem: the planted clique problem. 

5.2 The Planted Clique problem 

Fix an integer m > 2 and let G m denote the set of undirected graphs on m 
vertices. Denote by Q(m, 1/2) the distribution over G m generated by choosing 
to connect every pair of vertices by an edge independently with probability 1/2. 
For any k G {2, ... , m}, the distribution Q(m, 1/2, k) is constructed by picking 
k vertices arbitrarily and placing a clique 1 between them, then connect every 
other pair of vertices by an edge independently with probability 1/2. Note that 
Q(m, 1/2) is simply the distribution of an Erdos-Renyi random graph. In the 



X A clique is a subset of fully connected vertices. 
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decision version of this problem, called Planted Clique, one is given a graph G on 
m vertices and the goal is to detect the presence of a planted clique. 

Definition 6. Fix m > k > 2. Let Planted Clique denote the following sta- 
tistical hypothesis testing problem: 

H P C : G~g(m,l/2)=pf ) 
Hf c : G ~ G(m, 1/2, k) = Pf ] . 

A test for the planted clique problem is a family £ = {£ m K }, where £ m K : G m — > 
{0,1}. 

The search version of this problem [Jer92, Kuc95], consists in finding the clique 
planted under Hf c . The decision version that we consider here is traditionally 
attributed to Saks [KV02, HK11]. It is known [Spe94] that if re > 21og 2 (m), the 
planted clique is the only clique of size k in the graph, asymptotically almost 
surely (a.a.s.). Therefore, a test based on the largest clique of G allows to distin- 
guish Hq and Hi for k > 21og 2 (m), a.a.s. This is clearly not a computationally 
efficient test. 

For k = o(y/m) there is no known polynomial time algorithm that solves this 
problem. Polynomial time algorithms for the case k = Cy/m were first proposed in 
[AKS98], and subsequently in [McSOl, AV11, DGGP10, FRIO, FK00]. It is widely 
believed that there is no polynomial time algorithm that solves Planted Clique for 
any k of order m c for some fixed positive c < 1/2. Recent research has been 
focused on proving that certain algorithmic techniques, such as the Metropolis 
process [Jer92] and the Lovasz-Schrijver hierarchy of relaxations [FK03] fail at 
this task. The confidence in the difficulty of this problem is so strong that it 
has led researchers to prove impossibility results assuming that Planted Clique is 
indeed hard. Examples include cryptographic applications, in [JP00], testing for 
fc-wise dependence in [AAK + 07], approximating Nash equilibria in [HK11] and 
approximating solutions to the densest re-subgraph problem by [AAM + 11]. 

We therefore make the following assumption on the planted clique problem. 
Recall that 5 is a confidence level fixed throughout the paper. 

Hypothesis Ape For any a, b £ (0, 1), a < b and all randomized polynomial time 
tests £ = {£ m ,re}, there exists a positive constant T that may depend on £, a, b 
and such that 

Pf \UA G ) = 1) v Pf\UAG) = 0) > 1.25, V mf < Tk < m% . 

Note that 1.25 < 1/2 can be replaced by any constant arbitrary close to 1/2. 
Since k is polynomial in m, here a randomized polynomial time test is a test that 
can be computed in time at most polynomial in m and has access to extra bits of 
randomness. The fact that T may depend on £ is due to the asymptotic nature 
of polynomial time algorithms. Below is an equivalent formulation of Hypothe- 
sis Apo 

Hypothesis Bpc For any a, b G (0, 1), a < b and all randomized polynomial time 
tests £ = {£ m ,«;}) there exists mo > 1 that may depend on £, a, b and such that 

P G) (WG) = 1) V pS G) (£ m , K (G) = 0) > 1.25 , V mi < « < m* , m > m . 
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Note that we do not specify a computational model intentionally. Indeed, for some 
restricted computational models, Hypothesis Ape can be proved to be true for 
all a < b G (0, 1) [RoslO, FGR + 13]. Moreover, for more powerful computational 
models such as Turing machines, this hypothesis is conjectured to be true. It was 
shown in [BR12] that improving the detection level of the test based on SDP 
would lead to a contradiction of Hypothesis Ape for some b G (2/3, 1). Herefater, 
we extend this result to all randomized polynomial time algorithms, not only 
those based on SDP. 

5.3 Randomized polynomial time reduction 

Our main result is based on a randomized polynomial time reduction of an 
instance of the planted clique problem to an instance of the sparse PC detection 
problem. In this section, we describe this reduction and call it the bottom-left 
transformation. For any [i € (0, 1), define 

R^ = R n{k> n^} n {n< d} . 

The condition k > n^ is necessary since "polynomial time" is an intrinsically 
asymptotic notion and for fixed k, computing k^gx takes polynomial time in 
n. The condition n < d is an artifact of our reduction and could potentially 
be improved. Nevertheless, it characterizes the high-dimensional setup we are 
interested in and allows us to shorten the presentation. 

Given (d,n,k) £ R^, fix integers m, k such that n < m < d, k < k < m 
and let G = (V, E) £ G2m be an instance of the planted clique problem with a 
potential clique of size k. We begin by extracting a bipartite graph as follows. 
Choose n right vertices Vright at random among the 2m possible and choose 
m left vertices Vi eit among the 2m — n vertices that are not in Fright- The 
edges of this bipartite graph 2 are E n {Vi e ft x Fight}- Next, since d > m, add 
d — m > 1 new left vertices and place an edge between each new left vertex and 
every old right vertex independently with probability 1/2. Label the left (resp. 
right) vertices using a random permutation of {1, ... , d} (resp. {1, . . . , n}) and 
denote by V = ({1, . . . , d} x {1, . . . , n}, E) the resulting d x n bipartite graph. 
Note that if G has a planted clique of size k, then V has a planted biclique of 
random size. 

Let B denote the d x n adjacency matrix of V and let r)\,...,r) n be n i.i.d 
Rademacher random variables that are independent of all previous random vari- 
ables. Define 

xf ) =r H {2B i -l)z{-l,l} d , 

where Bi denotes the i-th column of B. Put together, these steps define the 
bottom- left transformation bl : &2m -^ H dxn of a graph G by 

(8) bl(G) = (x{ G) ,...,Al G >) GR dx ". 

Note that bl(G) can be constructed in randomized polynomial time in d, n, k, k, m. 



2 The "bottom-left" terminology comes from the fact that the adjacency matrix of this bi- 
partite graph can be obtained as the bottom-left corner of the original adjacency matrix after a 
random permutation of the row/columns. 
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5.4 Optimal detection over randomized polynomial time tests 

For any a G [1, 2], define the detection level 9 a > by 9 a = y^- ■ 
Up to logarithmic terms, it interpolates polynomially between the statistically 
optimal detection level 9* and the detection level 9 that is achievable by the 
polynomial time test based on SDP. We have 9* = 9\^/\ogd and 9 = C02\^iogd 
for some positive constant C. 

Theorem 7. Fix a G [1,2), fj, G (0, ^) and define 

(9) a = 2fi, 6=1- (2 -a)//. 

For any T > 0, there exists a constant L > such that the following holds. For 
any (d,n,k) G R^, there exists m,K such that (2m) a < Tk < (2m) a , a random 
transformation bl = {blrf jri] fc jmjK }, bld )n ,fc,m,K : ^2m — > R dxn that can be computed 
in polynomial time and distributions Po G T>q,Pi G T>\(L9 a ) such that for any 
test ip = {il>d,n,k}, we have 

PH^,n,fc = l)VPf n (V d ,n,* = 0) > Pf } (C m)K (G) = l)VPf } (e m , K (G) = 0)-g , 

w/iere £ m , K = ip<t,n,k ° b\d, n ,k,m,K- 

PROOF. Fix (d,n,k) G R^,a G [1,2). First, if G is an Erdos-Renyi graph, 

bl(G) = (X\ ,...,Xn ) is an array of n i.i.d. vectors of d independent Rademacher 

random variables. Therefore X[ ~ P Q G T>q. 

Second, if G has a planted clique of size k, let P bl ( G ) denote the joint distri- 
bution of bl(G). The choices of k and m depend on the relative size of k and n. 
Our proof relies on the following lemma. 

Lemma 8. Fix /3 > and integers m, k, n, k such that 1 < n < m, 2 < k < 
k < m, 

, . m 8 .,. uk , /m N , . uk 

(10) a _>— , (6)— > 16 log-, ( c )—>8k. 

n po m n m 

Moreover, define 

q_ {k-\)K ^ 
2m 

Let G ~ £(2m, 1/2, «) and bl(G) = (x{ G) , . . . , Ai G) ) G R dxn 6e de/med in (8). 

Denote byP 1 the distribution ofb\(G). Then, there exists a distribution Pi G 
V\(9) such that 

||-pbl(Cr) -p(g)n|| <■ or 

ll-^l ~~ r l 1 1 TV — P ' 

Proof. Let S C {1, . . . , n} (resp. Tc {1, . . . , d}) denote the (random) right 
(resp. left) vertices of V that are in the planted biclique. 
Define the random variables 

£■ = l{i G 5), i = l,...,n 
7 j = l{jGT}, j = l,...,d. 
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On the one hand, if i £ S, i.e., if e\ = 0, then X\ is a vector of independent 
Rademacher random variables. On the other hand, if i G S, i.e., if e\ = 1 then, 
for any j = l,...,d, 

X (G) = Y! ■ = { Vi if ^ = 1 ' 



' ; ' '' 1 nj otherwise, 



where r = {rij}ij is a n x d matrix of i.i.d Rademacher random variables. 
We can therefore write 



-( G ) _ (-[ J\„ i Jv' 



X^ = (l-e:)r i + e^, i = l, 



i 



,n . 



where Y/ = (Y/j_, • • • , Y- d ) and r- is the ith row of r. 

Note that the e^s are not independent. Indeed, they correspond to n draws 
without replacement from an urn that contains Ira balls (vertices) among which 
k are of type 1 (in the planted clique) and the rest are of type (outside of 
the planted clique). Denote by p £ / the joint distribution of e' = (e^, . . . ,e' n ) and 
define their "with replacement" counterparts as follows. Let £\, . . . ,e n be n i.i.d. 
Bernoulli random variables with parameter p = ■£- < \. Denote by p £ the joint 
distribution of e = (ei, . . . , e n ). 

We also replace the distribution of the 7's as follows. Let 7 = (71, ... , j n ) have 
conditional distribution given e be given by 

d 

p Me (A) = P{ 1 ' e A\j2l > k,e' = e) . 
i=i 

Define (X 1 ,... ,X n ) by 

Xi = (1 - £i)n + SiYi , i = l,...,n, 
where Yi G R has coordinates given by 

Y ■■ = { Vi if 7j = l 
t,J \ nj otherwise 

With this construction, the XjS are iid. Moreover, as we will see, the joint distri- 
bution P 1 of bl(G) = (X{ , • • • , Xn ') is close in total variation to the joint 
distribution Pf n of (X x , . . . , X n ) . 

Note first that Markov's inequality yields 

Moreover, given Ya=1 s i = s » we have Yli=i 7* > C/ ~ %(2m — n,n — s,n). It 
follows from [DF80], Theorem (4) that 



%(2m — n,K — s,n) — B[n 



2m — n 



4n 4n 

< < — 

TV 2m — n m 



Together with the Chernoff-Okamoto inequality [Dud99], Equation (1.3.10), it 
yields 



n(K — s) n(n — s)^ /"i\lv^ \ n An hn 

-w — -~v o — -M-) \z2 £i = s ) <- + — = — 

2m — n V 2m — n n \ i f— ' / m m m 



i=X 
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Combined with (11) and view of (10)(6, c), it implies that with probability 1 — 
6n/m, it holds 



(12) > 7j>U>- \— log(— )> >k. 

Denote by p the joint distribution of (ei, . . . , e n , 71, . . . , 7^) and by p' that of 
(e[, . . . , e^,7i, • • • ,7^)- Using again [DF80], Theorem (4) and (10)(a), we get 

1 / 6n 6n 4n 8n 
IIP - P TV < 1" Ps' - Pe TV < h — = — ^pd- 
rn- m 2m m 

Since the conditional distribution of [X\, . . . , X n ) given (e, 7) is the same as that 
of bl(G) given (e',7'), we have 

iipj i(G) -pni T v = iip / -pii T v<^. 

It remains to prove that Pi € T>\{9). Fix v > and define Z S Bo(k) by 

3 \ otherwise. 

Denote by 5z C {1, . . . , d}, the support of Z. Next, observe that for any x, 8 > 0, 
it holds 

(13) Jnf (fe) Pi" (V„(«) - (1 + 0) < -*) < Pf»(9„(Z) - (1 + 0) < -») • 
Moreover, for any i = 1, . . . , n 

(Z J Xi) 2 = l{ke iVi + (1 - e{) J2 nj) 2 = £ik + (1 - e<)^( £> 

Therefore, since Z is independent of the r«s, the following equality holds in 
distribution: 



{Z r Xi) 2 dist. ! + ei(fc _ 1} + ?(l£i) £ w . 



£=1 



where cjj^, i, £ > 1 is a sequence of i.i.d Rademacher random variables that are in- 
dependent of the £jS. Note that by Hoeffding's inequality, it holds with probability 
at least 1 — v/2, 






log(2/i/) 



■;? 



Moreover, it follows from the Chernoff-Okamoto inequality [Dud99], Equation 
(1.3.10), that with probability at least 1 — v/2, it holds 

k — 1 v-^ (k ~ 1) fe — 1 / ; -, — ]~r 

> Si > np \/2np[og(2 v) . 

n ^— ' n n 

i=l 
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Put together, the above two displays imply that with probability 1 — u, it holds 



9 n (z) > i + VLiVn - k -^±J™i og (2 M - 41 / log(2/ ^ 



2m re V m V re 



> i + ( fc - X ) K _ L k {k^l)^]og{2M _ 4 /Iog(2/i/) 



2rre V 2rre re V re 



1 + 9 -,/ 2M -!2|PM_ 4 /log(2M 



n V re 

Together with (13), this completes the proof. □ 

Define N = [40/5] . Assume first that k > M n 4 - a where M > is a constant 
to be chosen large enough (see below). Take k = max (8, M \og(N)^Nk , m = iVn. 
It implies that 

- (jfe-l)K M/c 2 1 /jfe" 



2rre 4re 4M 1 ~t V n 

Moreover, under these conditions, it is easy to check that (10) is satisfied with 
(3 = 1/5 since and we are therefore in a position to apply Lemma 8. It implies 

that there exists Pi G V\{9) such that \\P^ {G) - Pf n || TV < 5/5. 

— 1 - 
Assume now that k < M n 4 ~ a . Take rre, k > 2 to be the largest integers such 

that 

rre < 27V(nfc 2 ~ a )^ r« < (2m)a . 

Note that Tk > (2m) 2 . Let us now check condition (10). It holds, for M large 
enough, 

( a ) IH > ^ ( n l+(2-a)M) 2h = N > 40/5. 
re re 

n«; 1 /re M 2 1 /m 

(6) > r\To — > rn^>161og( — 

m 2T(4Np V k 2 a 2T(4N)2 ^n 

, . n« 1 / re M 2 ~? , 

C) > r-J-: > j-k>8k. 

m "2r(4A^)l V k 2 -« 2T(4N)2 ~ 

Under these conditions, (10) is satisfied with f3 = 1/5 and we are therefore 
in a position to apply Lemma 8. It implies that there exists Pi £ T>^(0) such 

that ||P^ I(G) - Pf II < 8/5, where := { -^ > —^JE, taking L = 

II II IV Am 8 r(47V)2 V n 

min ( 4Afi-i i — b) > yields that Pi G V\(L6 a ) for any (d,n,k) G R^. More- 
over, 

Pf\4> o bl(G) = 1) V Pf } (V o bl(G) = 0) < P® n (V> = 1) V Pf n (V = 0) + 5/5 . 

a 

Theorems 5 and 7 imply the following result. 
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COROLLARY 9. Fix a e [1, 2), \i e (0, jr^j)- Conditionally on Hypothesis Ape, 
£/ie optimal rate of detection 8° over the class of randomized polynomial time tests 
satisfies 

M<e°<J^, (d,n,k)£R,. 
V n V n 

Proof. Let T denote the class of randomized polynomial time tests. Since bl 
can be computed in randomized polynomial time, ip £ T implies that £ = ipob\ £ 
T. Therefore, for all (d,n,k) E R^, 

inf P® n (V = l)VPf n (V = 0) > inf P[, G) (£(G) = l)VPf } (£(G) = 0)-0.2<5 = 5. 

where the last inequality follows from Hypothesis Ape with a, b as in (9). Therefore 
6° > a . The upper bound follows from Theorem 5. □ 

The gap between 0° and 6* in Corollary 4 indicates that the price to pay 
for using randomized polynomial time tests for the sparse detection problem is 
essentially of order \fk. 
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APPENDIX A: TECHNICAL LEMMAS 

Lemma 10. For all P G V , and t > 0, it holds 

p.(*i«(fi) >i+«/!+# <(£)W«. 

PROOF. We define the following events, for all S C {1, . . . ,d}, u € R p , and 
£>0 

A = {ALx(S)> 1 + 4^ + 4-} 
I V n n) 

As = (A max (S s )>l + 4J-+4-) 
I V n n) 

A u = (u T Su> 1 + 2W- + 2-). 
L V n n ) 

By union on all sets of cardinal k, it holds 



-4 c U A S . 

\S\=k 
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Furthermore, let A/5, be a minimal covering 1/4-net of S , the set of unit vectors 
with support included in S. It is a classical result that \Ms\ < 9 fc as shown in 
[VerlO] and that it holds 

Amax^s - Is) < 2 max u J (S - I p )u . 
Therefore it holds 

^c y x . 

Hence, by union bound 

P (A)< 2 ^ p o(A). 

|5|=fcneA/'s 

By definition of Do> Po(^4«) < e ~* f° r M2 = 1- The classical inequality ( fc ) < 
(%-) yields the desired result. □ 

Lemma 11. For all Po 6 Po, and S > 0, it holds 

P„(sDP t (E) < 1+2 J* l ^ id2/S) + 2 k ^/ S ) +2 /^gg +2 !g|Mj) ) > !_,. 

V V n n \ n n J 

Proof. We decompose £ as the sum of its diagonal and off-diagonal matrices, 
respectively A and i&. Taking U = — ^ in the dual formulation of the semidefinite 
program [BAdlO, BR12] yields 

(14) SDP fc (S) = min {A max (S + U) + kp^} < (A^ + k^ . 

We first control the largest off-diagonal element of £ by bounding l^loo with 
high probability. For every i ^ j, we have 

1 rl , n 



1-3 

Z L // 

1 



£=1 £=1 

-, n T i T n -. ti 



By definition of T>q, it holds for t > that 

P„(l* y l>2 v /I + 2£)<4e-<. 
Hence, by union bound on the off-diagonal terms, we get 



■^j&V-i 



P ( max | % | >2\l- + 2-) < 2<fe~ 
Taking t = log(4p 2 /<5) yields that under Pq with probability 1 — 5/2, 



(ib) |„ u < 2 ./M!fw +2 ^(^A) 



n n 
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We control the largest diagonal element of £ as follows. We have by definition 
of A, for all i 



1 



n 



Similarly, by union bound over the p diagonal terms, it holds 
PoflAloo > 1 + 2x1- + 2-) < de~ l . 



V n nJ 
Taking t = log(2p/5) yields, under Pq with probability 1 — 5/2, 



(16) |AU<l + 2,/ 1 °^^ + 2 1 °^ 2 ^). 

V n n 

The desired result is obtained by plugging (15) and (16) into (14). □ 
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